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Abstract 



This work is motivated by the problem of image mis-registration in 
remote sensing and we are interested in determining the resulting loss in 
^ ' the accuracy of pattern classification. A statistical formulation is given 

C/3 , where we propose to use data contamination to model the phenomenon 

of image mis-registration. This model is widely applicable to many other 
types of errors as well, for example, measurement errors and gross errors 
^ . etc. The impact of data contamination on classification is studied under 

' a statistical learning theoretical framework. A closed-form asymptotic 

, bound is established for the resulting loss in classification accuracy, which 

l/^ ■ is less than e/(l — e) for data contamination of an amount of e. Our bound 

CO ' is sharper than similar bounds in the domain adaptation literature and, 

unlike such bounds, it applies to classifiers with an infinite VC dimen- 
sion. Extensive simulations have been conducted on both synthetic and 
real datasets under various types of data contaminations, including la- 
bel flipping, feature swapping and the replacement of feature values with 
data generated from a random source such as a Gaussian or Cauchy dis- 
tribution. Our simulation results show that the bound we derive is fairly 
tight. 



1 Introduction 

A motivating example of this work is the problem of image mis-registration 
which occurs almost ubiquitously in remote sensing. Image mis-registration 
refers to the phenomenon where the image of interest is mapped or aligned to a 
wrong position. This is usually caused by errors in the image or data acquisition 
device or the inaccuracy of the underlying mapping algorithms which try to 
map data collected at different scales, at different times, or taken from different 
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angles. Figure [T] below illustrates an instance of image mis-registration where 
the image is tilted and then shifted by a small amount. 




Figure 1: The original (left) and the mis-registered (right) remote sensing images 
for a cropland. Each color corresponds to one land class. 



The problem of image registration is of primary importance in remote sens- 
ing land monitoring applications which typically require the use of a number of 
images acquired at different times or time sequence data that can characterize 
seasonal changes or multi-annual similarities (Defries and Townshend, 1999 [T5] : 
Liu et al., 2006 i27j). This demands image registration and can affect such appli- 
cations as image classification, change detection, ecological/climatological/hydrological 
modeling (Justice et al., 1998 [55]; Gong and Xu, 2003 [2T]) etc. Because image 
registration can never be perfectly made, a mis-registration error is inevitable. 
It has been suggested that mis-registration errors that are less than 0.5 pixels 
are acceptable in subsequent analysis (Gong et al., 1992 [2D]; Townshend et al. 
1992 [35]; Jensen, 2004 25 ). However, this is rarely achievable and it is thus 
important to assess the impact of image mis-registration. 

Of a similar nature are errors due to rounding or the inaccuracy of the mea- 
suring instruments. Besides, interference from electromagnetic waves, clouds or 
other unfavorable weather conditions can all cause errors to the remote sensing 
images. Additionally, various types of human errors often factor in where a 
small amount of arbitrary error maybe thrown in anywhere in the data or any 
part of the data can be missing. Errors of this type are often called gross errors, 
and are estimated to occur in about 0.1% to 10% of the data [l^. This estima- 
tion of the amount of errors will form the basis for our choice on the amount of 
data contamination in our simulation. 

We call errors discussed above broadly as data contamination. Data contam- 
ination can cause a disastrous effect to the data quality and may fundamentally 
impact subsequent analysis and inference. It is thus of significant practical im- 
portance to answer the question: How much does data contamination impact 
our analysis (classification)? Do current algorithms (classifiers) continue to 
work or how much do we lose in accuracy if a remote sensing image is mis- 
registered or the underlying data are contaminated? The goal of the present 
work aims to shed lights on these questions. To gain insights into the nature of 
data contamination, in particular the phenomenon of image mis-registration, it 
is highly desired to approach the problem with a formal model and to give some 
theoretical characterization. This forms the primary motivation of the present 
work. Our focus will be on classification. 



2 



Assume the data of interest are drawn i.i.d. from some probability distribu- 
tion G defined on W. By treating errors as contaminations to the probability 
distribution G, we arrive at the following statistical model for data contamina- 
tion 

G = {l-e)G + eH (1) 

where G is the distribution of the data after contamination and H is an arbitrary 
distribution. Model ^ is quite general, clearly it captures various types of data 
contaminations we have discussed (not the additive noise though). Note that, in 
the setting of classification, G is the joint distribution of the attributes and the 
label, thus a contamination under model ([1]) can mean that to the attributes, 
or the label, or both. The e in ([IJ can be thought of as the proportion of 
data (e.g., image pixels) that are "contaminated", e.g., being flipped in label or 
altered with data generated under a different distribution H. 

It is known that the effect of image mis-registration is determined by res- 
olution, scene structure and amount of registration error (e.g., 0.5 pixels or 1 
pixel, or 1.5 pixels on RMS error). In model ((iJ, we choose to use the pro- 
portion of pixels that are "contaminated" as a measure of the extent of image 
mis-registration. This is to capture the essence of image mis-registration and 
to uncover the relationship between the amount of mis-registration and the re- 
sulting loss in classification accuracy. This is different from the usual practice 
in the remote sensing community where the image mis-registration is quantified 
in term of a shift of a certain number of pixels. Since given the same amount 
of shift, the impact on classification is highly scene-dependent, e.g., the im- 
pact would be drastically different for a large land consisting mainly of forests 
and a small land parcel formed by corn fields and rice fields, it would then 
hardly be possible to establish a generic relationship between the amount of 
mis-registration and the resulting loss on the classification accuracy. 

Our contributions are as follows. We propose a statistical model for the 
phenomenon of image mis-registration. This data contamination model captures 
a wide range of errors such as label flipping, measurement errors, rounding 
errors and accidental human errors which occur almost ubiquitously in real 
applications. We study classification under data contamination in the statistical 
learning framework. A bound is obtained on the loss of classification accuracy 
(term this as the data contamination bound) due to data contamination (to 
the training data) in terms of its amount. This bound allows one to give a 
conservative assessment on if a class of classification algorithms, i.e., those which 
are universally consistent, continue to work under data contamination. 

The rest of the paper is organized as follows. In Section [51 we formulate the 
problem of classification under data contamination and obtain a bound on the 
loss in classification accuracy in terms of the amount of data contamination. 
This is followed by a discussion of related work in statistics, remote sensing and 
machine learning in Section |3l and in particular we compare various aspects 
of our bound with the finite sample type of bounds established in the recently 
emerging area-domain adaptation. In Sectional we conduct extensive simula- 
tions on the impact of data contamination to classification performance of SVM 
for a number of synthetic and real datasets under various types of data contam- 
inations. In Section 14.51 we briefly discuss heuristics to estimate the amount of 
data contamination for the case of image mis-registration. Finally we conclude 
in Section [5] In this section, we also collect results from the literature on the 
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impact of classification performance by AdaBoost due to label flipping; addi- 
tionally, we give insight on using data contamination as a model to understand 
co-training, which is particularly useful in situations where training data are 
scarce. 

2 Classification under data contamination 

Classification is an important problem in pattern recognition. However, as dis- 
cussed in Section [TJ especially in the context of land-cover, land-use mapping, 
crop yield estimation and many other important applications in remote sensing, 
the classification result may be affected by data contamination. In this section, 
we will study classification under data contamination with model ^ and de- 
rive a bound on the resulting loss in classification accuracy. We start by an 
introduction of the statistical learning framework for classification [15] . 

2.1 Classification in the statistical learning framework 

In statistical learning, a classification rule (or classifier) is defined by a map: 
X y where X is the sample space for observations and 3^ is a finite set 
of labels. For simplicity, we consider throughout a two-class problem where 
3^ = {0,1}. 

Associated with each classifier, there is a performance measure called loss 
function, denoted by l{f,X,Y). The loss function that is of special interest is 
the 0-1 loss, defined as 

i{f,x,Y) = l ? ''[l{n^)>o}-Y 

' [1 otherwise ^ ^ 

where / is a decision function and y is the indicator function. Here we call a 
function / a decision function if a decision rule can be written as -^{/>o}- 

Definition. Let P be the joint probability distribution of X and Y. Then 
the risk associated with a decision function / is defined as 

Rpif) = Ep/(/,x,y) = p(r^/{/(x)>o})- (3) 

Similarly, the empirical risk for a decision function /, on a training sam- 
ple (A"i, Yi), {Xn, Yn), can be obtained by replacing P in the above with its 
empirical distribution P„. 

Fix a probability distribution P and a function class Q, the goal of classifi- 
cation is to find a decision rule fgGG that minimizes Rp{f), i.e., 

/* =argmini?p(/). (4) 

The rule learned from the training sample (^i, Yi), {Xn, Yn), denoted by /„, 
can be defined similarly by substitution of P with ]P„ in (U) . 

Definition. Fix a probability distribution P. The function that achieves 
the minimum risk, among all possible decision rules, is called the Bayes rule. 
The corresponding risk is called the Bayes risk and is denoted by Rp. 

For the 0-1 loss as defined in ([2]) and a fixed probability distribution, the 
Bayes rule is given by 

f3{x) = /{^(2;)>0} 
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where 

ri{x) = F{Y = I \ X = x) ~ 0.5 

is called the Bayes decision function. 

Definition. A classification algorithm is universally consistent if, for all 
distributions P, 

as n — > oo where a.s. stands for almost surely. 

Notation. To simplify notation, we adopt the following convention. Denote 
R = Rq and R = Rq. Also we use "to indicate a quantity associated with the 
contaminated distribution G. In particular, /„ and /„ are the classifiers learned 
from a training sample of size n from G and G, respectively; and 77, r) and rj^ 
are the Bayes decision function under G, G and H, respectively. 

2.2 A bound on the loss of classification accuracy 

In the standard setting of statistical learning theory, one is interested in the 
consistency of a classifier, /„, obtained via empirical risk minimization, that is, 

R{fn) R* 

as n ^ 00. In such a case, the classifiers /„ are trained and tested with data 
generated from the same probability distribution G. 

In the present work, we consider a different setting where the probability 
distribution, G, of the training sample differs from that of the test sample, G. 
Of course if G and G are "totally" different, then there is no hope of learning. 
We thus make the assumption that G and G differ by a small amount in the 
sense of a "small" e under model ([Ij . Clearly the rule learned from a training 
sample under G will be different from that under G. Since the test sample is 
from G, classifier trained under G would typically have a larger classification 
error. One important question is, how much additional classification error will 
be introduced if the classifier is trained on a sample from G (instead of G) when 
testing on a sample generated from G. 

Really we wish to know how much R{fn) is different from R{fn) as n — )■ 00 
for e small. As we do not have access to data from G, a natural proxy for R{ fn) 
is R* since — > i?* as n — >• 00 for consistent classifiers /„. We start by the 
following risk decomposition 

R{fn)-R* - R{fn)-R{fi) + Rifi)-R*. (5) 

The R{fi) — R* term in ([5]) can be bounded by a term that depends only on 
the amount of contamination, e, under some weak assumptions. This is stated 
as Theorem [1] The term R{fn) — Riv) can be shown to vanish as the training 
sample size increases if the underlying classifier is universally consistent. This 
is stated as Theorem [2] Note that here the convergence rate may be different 
for different types of classifiers. 

Theorem 1. If gix), the probability density function ofG, exists, then for data 
contamination with any distribution H , 

R{v)-R* < 

where the equality holds if and only if the followings are true 
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a) 



0.5 -R* , IvjxMx) 
e — , n(x) — , ana 



b) Ph{Y — ~ x) = 1 when r]{x) < 0, and otherwise. 
Remark. 

1. The bound as stated in Theorem [T] is sharp as it is achievable under a 
special case as noted in the statement of the theorem. 

2. A related data contamination model is as follows. 

dG{x) = [1 - e{x)]dG{x) + e{x)dH{x) (6) 

such that < e{x) < e < 1 for some positive constant e where G, H, G are 
probability distribution functions. Model © allows the amount of data 
contamination to be data dependent as long as the amount is uniformly 
smaller than a constant. Similar result as Theorem [T] can be obtained. 

To prepare for the proof of Theorem [U we have the following lemma. 

Lemma 1. Let f be a decision function. Further assume V{f{X) = 0) = 0. 
Then 

R{f) - 0.5-E[rj{X).stgn{f{X))] 
where sign{x) = 1 if x > and — 1 otherwise. 
Proof. Note that we can write 

i?(/) = EG|r-/{/(x)>o}|. 

Thus 

R{f) = E[r./{^(x)<o}] +lE[(l-y)./{/(x)>o}] 

= 0.5 + E [{Y ~ 0.5).I^f^x)<o}] + IE [(0.5 - l')./{/(x)>o}] 
= 0.5-E[r;(X).sign(/(X))]. 

□ 

The posterior probability fj^x) + 0.5 under the contaminated distribution G can 
be written as 

7?(.x) + 0.5 = [1 - ae{x)]{r]{x) + 0.5) + a,{x){r]" (x) + 0.5) 

where 

ae{x) = €h{x)[{l - e)g{x) + eh{x)]^^. 

Here g and h are the continuous density or discrete probability functions corre- 
sponding to G and H, respectively. Then 

f] — (1 - Q;c)r; + a^r]^ . 
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Proof of Theorem [II By Lemma [TJ we have 

i?(77) -R* = E ivisignir,) - signm] = 2E [|7?|./{,^<o}] 
Next notice that, if 



2ij 



<0, 



then this imphes 



Hence, 



2|ry| < 



e h{x) 
(1-ae) l-tg{x)' 



R(f)) -R* < 2E 



< 



h{X) 



-E 



1-e g{X) 
e 



(7) 

(8) 



1 



The equahty in ^ holds if and only li -q" ^ -\, ov, Ph{Y ^ l\X ^ x) ^ I 
when r]{x) < 0, and otherwise, i. e. for the same observation X — x^ the worst 
rule under H assigns a completely oppositive class membership w.r.t. that under 
G. Further, the equality in ^ holds if and only if 



which implies 



since / h{x)dx — 1. Thus, 



2\t^{x)\ = 



2E\r]\ 



e h{x) 
1-e g{x) ' 



2E|r;| 



1 + 2E|7/| 

by Lemma [T] This concludes the proof. 



0.5 - R* 
l-R* 



□ 



Theorem 2. Suppose a classification algorithm is universally consistent. Then, 
under data contamination model ([1]), we have 

rCIu) ^ i?(r?) 

as n ^ oo. 

The proof of Theorem [5] relies on the following lemma. 

Lemma 2. Assume ¥ {ri{X) = 0) — 0. If R{fn) — > R* . then the decision 
induced by fn converges to the Bayes rule in probability as n ^ oo. 
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Remark. Theorem 2 of Bartlett and Tewari [3] implies that the decision 
rule given by SVM converges to the Bayes rule. Lemma [5] is more general in 
that it applies to all consistent rules. 

Proof. Without loss of generality, assume the decision function /„ is already 
centered, i.e., the corresponding decision rule can be written as I{f„>o}- From 
Lemma [H we have 

R{fn) = 0.5 - E (7?(X) * sign(/„(X))) . 

Let 

^n{x) = |sign(?7(a;)) - sign(/„(a;))|, 
then £,n{x) takes two values {0, 2}. We have 

Rifn) - R* 

= nv{X)) [sign(r?(x)) - sign(/„(x))] 
= E|r7,„(X)|.e„(X). 

Thus, P(^„(X) = 2) — > by assumption R{fn) — > i?* as n — )• oo. That is, 
I{f,^[x)>o} converges to I{ri{x)>o} in probability as n — > oo. □ 



Proof of Theorem [H By universal consistency and Lemma [H we have 
-^{sign(/„(x))7^sign(77(x))}^^(^) ~^ ^- 



Thus 

y -^{sign(/„)5^sign(f,)}«''^ ^ 0' 
implying that, as n — >■ oo, 

R{U) -> Riv). 

□ 

By risk decomposition ([5]) as well as Theorem [T] and Theorem [21 wc arrive 
at a sharp asymptotic data contamination bound as 

0(c(n)). (9) 



1 - e 

where c{n) — R{fn) — R{v) indicates the rate of convergence with c{n) — ^ as 
n oo. 

Bound ^ implies that, when the amount of data contamination is "small" , 
i.e., e — >■ 0, we can make 

\R{U)-R*\^0. 

That is, as long as a classifier is consistent in the standard setting and the 
amount of contamination is small in the sense of a small e, this classifier suffers 
very little from data contamination. This explains why, empirically, classifiers 
such as SVM or others work well even when a small fraction of labels are ran- 
domly flipped. 

Theorem [2] relies on the universal consistency of a classifier. Fortunately, 
several of the currently most popular classifiers are universally consistent, for 
example, SVM [331 and Adaboost with early stopping [J. 
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3 Related work 



The study of data analysis and statistical inference under data contamination 
has been a long-standing research topic in statistics and machine learning. The 
earliest work can be traced back to at least a half century ago, see, for example, 
Tukey [37) for a survey on sampling from contaminated distribution. Extensive 
studies have been carried out since under the name of robust estimation ([241 
mj), measurement error model ( [TSl IHl HI] ) etc. However, work along this line 
concerns primarily problems on regression or estimation. 

Relevant literature in remote sensing, however, has been sparse. Swain el 
al [3S] investigated the impact of image mis-registration to classification. How- 
ever, this work is purely empirical and their results depend highly on the un- 
derlying scenes in the image; for example, even under the same amount of 
mis-registration, the impact would be considerably different on images formed 
primarily by large forest lands and those formed by many small patches of dif- 
ferent land types such as corns and plants. Additionally, Townshend el al |36] 
considered the impact of image mis-registration to change detection. Xu et al 
[39] study parameter estimation for a simple linear model under measurement 
errors due to a mismatch of locations and scales. 

Related machine learning literature is much richer. Such work can be broadly 
divided into two stages. The first stage, roughly before year 2005, mostly deals 
with data contamination in the form of label flipping and empirical study of 
its impact on the performance of various classifiers. This includes Dietterich 
[IB] and Breiman [S] which evaluate the robustness of learning algorithms such 
as bagging, AdaBoost and Random Forests against label flipping. Other work 
includes ( j^Ul [5^ HI] ) and references therein. The second or the current stage, 
which is closely related to the present work, deals with domain adaptation. Do- 
main adaptation is a broader concept than data contamination in that it does 
not specify explicitly the nature of the difference between the source (or train- 
ing) distribution and the target (or test) distribution as long as their difference is 
small whereas data contamination almost exclusively refers to model ([1} . There 
have been numerous papers published on domain adaptation, including appli- 
cations, theory and methods, and it is beyond the scope of the present paper 
to give a detailed account here. Work that is closest to ours include (0 [28] ) 
(see also references therein). In particular, Ben-David et al |5j established the 
following bound. 

Theorem 3 ([S])- Let Ti be a hypothesis space of VC dimension d. If Us, Ut are 

unlabeled samples of size m' each, drawn from the source distribution Vg and 
the target distribution Vt respectively, then for any 6 £ (0,1), with probability 
at least 1 — S (over the choice of the samples), for every h £ Ti, the difference 
between the error rates es o,nd et satisfies 

erih) ~ es{h) 

2 V m' 

where X is defined by 

X = argmin[e5(/i) + eT(h)] 
hen 

with the subscripts S, T indicating guantities related to the source and target, 
respectively. 
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The bound established in [3S] is similar in nature which replaces the VC 
dimension in [S] with the Rademacher complexity [51. However, there are im- 
portant differences between the bound in Theorem [3] or that in [2 8) and ours 
(i.e., Theorem [T]). 

(1) The nature of bounds is different. The bounds in ([5J[2H]) are finite sample 
learning generalization type of bounds while our bound is a large sample 
bound (i.e., asymptotic bound). 

(2) The quality of the bounds is different. The bounds in [5] are union bounds 
that rely on the Vapnik-Chervonekis (VC) dimension |3H], and are often 
quite loose ([2^ uses the Rademacher complexity but still quite loose). 
In contrast, our bound is a sharp bound asymptotically. Assume the 
underlying function class has a finite VC dimension and let m' oo, then 
the bound in Theorem [3] becomes e + A, which is looser than our bound 
e/(l — e) « e for small e. Since the A term depends on the difficulty of the 
underlying problem and generally does not vanish, in no way would the 
bounds in [5j imply ours. 

To better appreciate the difference in the quality of the bounds when 
the sample size increases, we will show an example where the data is 
generated by a two-component Gaussian mixture and contaminated by 
Cauchy data (See Section |3] for details on the Gaussian mixture and the 
Cauchy). Since it is not easy to directly compute A, we replace it with its 
lower bound argmin/ig-^ e5(/i) -|-argmin/ig^ ^T(h), which are estimated as 
the error rates of SVM on the data when the training sample size is large. 
Figure [5] shows the asymptotic data contamination bounds of ours and 
that established in 5 for the amount of data contamination varying from 
{0.01,0.02,0.03,0.04, 0.05,0.10}. One can see that here the Ben-David et 
al bound is much looser than ours, and for this particular Gaussian 
mixture data, the Ben-David et al bound is not very informative as it 
quickly approaches 0.5. 



-^Our bound 

- Bound of Ben-David et al 



-*----*- 




0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 
Amount of data contamination (Gaussian Mixture) 



Figure 2: Comparison of data contamination hound for Gaussian mixture data 
with e G {0.01,0.02,0.03,0.04,0.05,0.10}. 
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(3) Whereas our bound applies only to universally consistent classifiers, the 
bound in fS] applies only to classifiers from a function class with a finite 
VC dimension. This is a limitation that cannot be overlooked. For exam- 
ple, the function class corresponding to the Gaussian kernel (see discussion 
after example 1.9 in [34l and the fact that the Gaussian kernel is a uni- 
versal kernel), or the polynomial kernel (if no upper bound is imposed on 
the degree of polynomials), or the one nearest neighbor classifier, or the 
tree-based classifier (without regularization) all have an infinite VC dimen- 
sion. Consequently, the bound in (j2]) excludes some of the best classifiers 
available today, including SVM with the Gaussian kernel. Boosting (or 
Bagging ) on tree-based classifiers etc while our bound clearly does not 
have such a restriction. 

4 Experiments 

Empirical studies are performed on three different types of datasets, 3 synthetic 
datasets, 10 UC Irvine datasets [T] and a simulated remote sensing image. For 
each dataset, four different types of data contaminations are applied to the 
training set and classification accuracy evaluated on the uncontaminated test 
set. SVM is used as the underlying classifier due to its universal consistency 
[3 3) and the availability of a widely used software implementation (libsvm |10l). 
The five different types of data contaminations are as follows. 

Co- Randomly flip the labels of a randomly selected subset of observations 
from a flxed class. 

Ci. Randomly flip the labels of a randomly selected subset of observations 
from all classes. 

6*2. Randomly select a subset of observations and replace the feature values 
of each with that of a randomly chosen observation (the labels are kept). 
Call this feature swapping. 

Cc- Replace a randomly selected subset of observations with Cauchy data with 
the labels kept. 

Cg. Replace a randomly selected subset of observations with Gaussian data 
with the labels kept. 

Co, Ci, Cg are used to simulate data contamination of different natures. 

• Ci and C2 are expressly designed to simulate image mis-registration, which 
we believe capture important aspects of image mis-registration. 

• Cc and Cg are used to simulate gross errors. Cg is for errors with a 
Gaussian nature while Cc is for errors with a heavy tail, that is, the error 
could be very large and this is to simulate accidental human error, for 
example, a shift in decimal place of a number. 

• Additionally, we also attempt to simulate extremely large errors by scaling 
the centers of the Gaussian and Cauchy by a factor of 100, that is, the 
centers are multiplied by 100 coordinate-wisely. These are denoted by 
Cgioo and Cdoo, respectively. 
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Figure 3: Scatter plot of 1000 observations generated i.i.d. from Gaussian 
A/'(/i, S) with ji — (1,0) and S — A*" A with entries of A generated i.i.d. from 
IA[Q, 1]. Data from the two classes are represented as diamonds and solid circles, 
respectively. 

• Co is used to simulate a class of unfavorable situations where data con- 
tamination occurs in part of the data space. Such cases typically make 
classification more challenging. In contrast, other simulations are more or 
less average cases as the data contamination occurs uniformly across the 
whole data space. 

For Cg, the replacement Gaussian data is generated i.i.d. from A/'(/i, S) with fj, 
and E calculated empirically on the non-contaminated training set. For Cc, the 
Cauchy data is generated i.i.d. according to 

Z/W, for Z - A/'(/i, T,), W [r(0.5, 2)]^^^ 

with Z and W independent where r(0.5, 2) is a random variable generated from 
a Gamma distribution with parameters 0.5 and 2. For each run, fj, is generated 
uniformly from the interval [min(X), max(X)] and E estimated empirically from 
the training set. 

For an illustration of the effect of these different types of data contamination, 
see Figure [3] for the original data and Figure 0] for the data after contamination 
of different types. 

4.1 Synthetic data 

The three synthetic datasets used in our experiment are the Gaussian mixture 
data, the four-class and the nested-square data. The Gaussian mixture data are 
used to simulate cases with a linear decision boundary while the four-class and 
the nested-square datasets are for cases where the decision boundary is highly 
nonlinear and non-convex. For each of the 3 datasets, we take 80% for training 
and the rest for test. Then 100 instances of data contamination are applied 
and loss in classification accuracy are averaged. This is repeated and results 
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CI (5%) 



CI (10%) 




-5 5 10 15 EO 25 30 -20 -10 10 

Cd(5%) Cc(10%) 



Figure 4: Illustration of the effect of different types of data contamination. The 
original Gaussian data are displayed in Figure\^ The 4 rows of plots correspond 
to Ci,C2,Cg,Cc, respectively and figures in the left and right columns are for 
data contamination at 5% and 10%, respectively. Data from the two classes are 
represented as diamonds and solid circles, respectively. 
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are averaged. The Gaussian kernel is used with SVM for ah three synthetic 
datasets. 

The Gaussian mixture data are generated according to the fohowing 

10x10 ) + (1 - A)M{ — A*, Sioxio) 

with P(A = 1) = P(A = 0) = 5 and Sioxio = A'^A for entries of A generated 
i.i.d. uniform from [0, 1], with fi = (0.5, ...,0.5)"^. Data points with A = 1 are 
assigned label 1 and those with A = are assigned label 2. The sample size for 
the training set and test set are 1000 and 2000, respectively. Loss in classification 
accuracy under data contamination of different types and at different amounts 
are shown in Figure [5] Note that here we are using only the first term in Q as 
an estimate of the overall loss in classification accuracy while ignoring the second 
term, thus when the training sample size is not large enough, some adjustment 
(in the order of 0{c{n))) might be required. 




Figure 5: Empirical and theoretical data contamination bound for data generated 
from a Gaussian mixture with e £ {0.01,0.02,0.03,0.04,0.05,0.10}. 





Figure 6: The four-class and nested-square data. Different colors correspond to 
points from different classes. 

The four-class and nested-square datasets were originally used to demon- 
strate the superior performance of a class of projectable classifiers for data with 
a highly complex decision boundary [33]. Figured is a plot of these two datasets 
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and the data contamination bounds are shown in Figure[71 Note that the bound 
as estabhshed in ^ is for 2-class classification. When there are multiple classes, 
we can get a bound by repeatedly apply the the 2-class bound. Let the class 
distribution be denoted by {wi, ...,wj} such that wi > ... > wj. Then we get 
the following multi-class bound 

— ^ — \l + {w2 + ... + wj)a + ... + {wij_i} + wj)a''~'^] 

where a — 1 ~ and e is the mount of contamination. This is used as the 
theoretical bound in our simulations when there are more than two classes. 




Amount of data contamination (4-ciass} Amount of data contamination (Nested Square} 



Figure 7: Empirical and theoretical data contamination bound for the J^-class 
and the nested-square datasets with e £ {0.01,0.02,0.03,0.04,0.05,0.10}. 



4.2 UC Irvine datasets 

A total of 10 datasets are taken from the UC Irvine Machine Learning Reposi- 
tory [T] in our experiment. A summary of these datasets is provided in Table [1] 
and more details can be found from [1]. 

Table 1; Summary of the UC Irvine datasets used in our experiment. 





Training 


Testing 


Features 


Classes 


imageSeg 


210 


2100 


19 


7 


Vowel 


528 


462 


10 


11 


Satellite images 


4435 


2000 


36 


6 


Glass 


214 




10 


6 


Vehicle 


946 




18 


4 


German credit 


1000 




24 


2 


Yeast 


1484 




8 


10 


Wine quality 


1599 




11 


6 


Musk 


6598 




168 


2 


Magic gamma 


19020 




10 


2 



Some data sets come with predetermined training and test sets, which in- 
cludes the image segmentation, vowel and satellite image datasets. Otherwise 



15 



we split the data into a training and test set. For small to medium sized datasets, 
i.e., Glass, Vehicle, German Credit, Yeast and Wine Quality (red wine), we take 
80% of the data for training and the rest for test. For large datasets, i.e., the 
Musk and Magic Gamma Telescope, 20% and 10%, respectively, of the data 
are set aside for training and the rest for test. For each dataset, 100 instances 
of data contamination are applied to the training set and the resulting data 
contamination bounds are averaged. This is repeated and results averaged. 

The Gaussian kernel is used for all except the image segmentation dataset 
where a polynomial kernel with degree 3 is used. Tuning parameters for SVM 
are chosen so that the classification performance matches that reported in the 
literature (see, for example, references cited in the description of each dataset 
in [T]). Some datasets are linearly scaled to [0, 1] so as to speed up the painfully 
slow optimization of the SVM package; this includes the Musk, Magic Gamma, 
Satellite image. Vehicle, and the Wine quality dataset. The data contamination 
bounds by SVM on the UC Irvine datasets are plotted in Figure IS] 




Figure 8: Empirical and theoretical data contamination bound for UC Irvine 
datasets (only 6 of them are shown here so that they can be placed in the same 
page, the rest are similar) with e G {0.01,0.02,0.03,0.04,0.05,0.10}. 
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4.3 Remote sensing image 

The remote sensing image used in the experiment is about a cropland with 5 
different land-use classes. The image size is 596 pixel by 529 pixel. The features 
of interest are taken from the annual vegetation index time series (see Figure |9]) 
at an interval of 30 days among which 10 are used with each corresponding 
to one scene of image at a different time of the year. The vegetation index is 
an optical measure of vegetation canopy greenness and is closely related to the 
photosynthetic potential of plants. For each pixel, random noises, generated 
from Gaussian A/'(0,0.1^), are applied. 



1.0 I I I I 1 I I I I I j 1 I I I I I I I I I I I I I I I 




Q Q Hi I r ) I I I I I I t I I I I I I I I I I I I I I I 

SO 100 ISO 200 2S0 



Figure 9: The annual vegetation index. The x-axis is the day of a year and 
different colors indicate different land classes. 

To simulate the acquisition of remote sensing images, the following procedure 
is performed on each of the 10 scenes of image. 

1. Rotate all images clockwisely by 10 degrees. 

2. Re-sample each scene of image using a randomly generated offset from 
AA(0,0.l2). 

3. Remove the blank edges in all images that are caused by rotation and 
re-sampling. 

In Step 2 of the above, offsets are generated from the standard Gaussian and a 
bilinear interpolation [19] is applied during re-sampling. As a result, 247 pixel 
by 233 pixel multi-temporal vegetation index images for the cropland of interest 
are generated. 

To assess the impact of image mis-registration to the task of classification, 
two mis-registered images (corresponding to Case I and II in Table [21 respec- 
tively) are generated under different levels of mis-registration (roughly corre- 
sponding to 3% and 4% data contamination, respectively). The SVM classifier 
is trained on a sample from the original image and the mis-registered image, 
respectively, and then test on a sample taken from the original image. We use 
the data in a similar fashion as the 5-fold cross-validation, i.e., select 4 folds for 
training and rest for testing. Table [D reports the classification accuracy. We 
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Table 2: Accuracy of SVM for the cropland remote sensing image under different 
amount of image mis-registration. Each of the first 5 columns corresponds to 
one of the 5 folds. 



Fold 


1 


2 


3 


4 


5 


Average 


Original 


98.13 


98.14 


97.90 


97.94 


97.98 


98.02 


Case I 


98.08 


98.11 


97.92 


97.94 


97.98 


98.01 


Case II 


98.09 


98.10 


97.92 


97.92 


97.96 


97.99 



can see that, in both cases, the loss in classification accuracy is small and can 
be well bounded by our theoretical predication. 

It is known that, for example by bootstrap, the effect of mis-registration on 
image classification varies with the relative size of the ground area corresponding 
to an image pixel (call this the pixel size) and the actual homogeneity (larger 
numbers correspond to more homogeneity) of an area. If the ratio of these two 
numbers is small, then the damage of mis-registration is small, otherwise it is 
large. Since we are using a crop field here and the corresponding pixel size is 
much smaller than that for the crop field, the effect of data contamination is 
small. If the pixel size is close to the actual object size, then mis-registration of 
half a pixel may cause more damages. 

4.4 Some empirical results on Adaboost 

So far SVM has been used as the underlying classifier in our experiment, other 
universally consistent classifiers such as Adaboost are applicable as well. In- 
stead of repeating the experiment for AdaBoost, we collect results found in the 
literature [ITl [161 H] and summarize in Table |3l Note here we simply adopt the 
existing results and this corresponds to taking e — 0.05 only. 

Table 3: Error rates of Adaboost on some UC Irvine datasets where 90% of the 
data are used as the training set. Results are shown for the original data and 
when 5% of the class labels in the training set are randomly flipped (uniformly 
into an alternate class). Results are adopted from ^ |J7tf7^ and then converted. 





Original data 


5% labels flipped 


Difference 


Glass 


22.00% 


22.35% 


0.35% 


Breast cancer 


3.20% 


4.58% 


1.38% 


Diabetes 


26.60% 


28.41% 


1.81% 


Sonar 


15.60% 


17.96% 


2.36% 


lonsphere 


6.40% 


8.17% 


1.77% 


Soybean 


7.57% 


9.61% 


2.04% 


Ecoli 


14.80% 


15.91% 


1.11% 


Votes 


4.80% 


7.14% 


2.34% 


Liver 


30.70% 


33.86% 


3.16% 
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4.5 Estimating the amount of data contamination 

Using data contamination bound (jH]), we can estimate the loss in accuracy for 
classifiers trained with contaminated data. The remaining question is to give a 
(rough) estimate of the amount of data contamination. This is a question we 
would like to leave to future work. 

In the special case of image mis-registration, we propose two simple heuristics 
for estimating the amount of data contamination. Both are based on the heuris- 
tic that the image pixels affected by mis-registration are roughly those near the 
boundary between different land classes. Thus the proportion of boundary pix- 
els serves as a good indication on the amount of data contamination. Here the 
underlying assumption is that the proportion of boundary pixels are roughly 
the same in the true and the mis-registered images. 

One approach is based on sampling. A number, say 100 to 200, of pixels 
are randomly sampled from the image, we then count the proportion of pixels 
that fall on the boundary by visual inspection. Another estimate is based on 
the classification results by a classifier trained on the contaminated data. For 
each pixel, we determine if it is on the boundary by the following heuristic. For 
each pixel in the image, take a 3 x 3 patch centering on it. If there are at least 
two pixels within the patch having a different class labels from the rest, then 
declare the pixel at the center of the patch to be on the boundary. 

5 Conclusion and discussion 

We formulate the problem of image mis-registration as data contamination and 
equip it with a statistical model. This model captures a very general class of 
errors, for instance, measurement errors and gross errors that can be formu- 
lated as label-flipping, feature-swapping, or feature replacement by any proper 
distributions. Under a statistical learning theoretical framework, we derive an 
asymptotic bound for the loss in classification accuracy due to data contamina- 
tion. One nice feature about this bound is that, it is essentially distribution- free 
thus it applies to all different types of data. Extensive simulations on both syn- 
thetic and real datasets under various types of data contaminations show that 
the data contamination bound we derive is fairly tight. Compared to similar 
bounds in the domain adaptation literature, our bound is sharper and, unlike 
such bounds, our bound applies to classifiers with an infinite VC dimension. 

As we have already discussed, our data contamination model can capture 
various types of errors such as image mis-registration, label noise and acciden- 
tal human errors. Beyond that, we can also use data contamination as a useful 
device. We give here an example in the setting of co-training ([lUl ISl ITT]). 
Empirically, it has been shown that co-training can significantly boost the clas- 
sification accuracy when the training sample size is extremely small, e.g., 12 in 
[6] for web page classification and 6 in [29] for newsgroup classification. Theo- 
retical work have been carried out to understand the success of co-training (see, 
for instance, [HlIIl]). We provide here a different perspective. 

In co-training, starting from a small amount of labeled examples, the al- 
gorithm progressively enlarges the labeled set by transferring those examples 
which are originally unlabeled but are classified with high confidence by the 
classifier built from the labeled data available so far. This amounts to enlarg- 
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£rr*(G) + 



+ 0(c{n)) 



Err*{G) 



Benefit of co-training 



£rr(Train on n examples from G) 



i?rr(Tra!n on / examples from G) 



Figure 10: The benefit of co-training. Err* denotes the Bayes error rate. 

ing the labeled set with a small amount of label noise; the label noise here is 
small because those examples which are being transferred are classified with 
high confidence. Assume at certain point we have n examples in the labeled set 
and assume n is large, then, by our analysis (c.f. ([9])), the additional classifica- 
tion error w.r.t. that resulting from a clean labeled set (of size n) is no more 
than e/(l — e) + 0{c{n)) for c{n) — > as n grows. Thus, 



where Err denotes the error rate. Here, we use G and G to denote the data 
with clean label and that containing labels assigned by the co-training algorithm, 
respectively. It is clear that the error rate achieved by co-training equals that by 
a classifier learned on n observations from G. However, it is often the case that 
the error rate by a classifier learned on I labeled examples from G is typically 
much larger, i.e.. 



if I is small, e is small and n is large. The gap between the two Err terms in 
([TI])) is the potential "benefit" of co-training as illustrated in Figure [TUl This 
explains why co-training may be feasible with a small amount of initial labeled 
examples. Since the gap in (|TU)) shrinks as I increases, this, on the other hand, 
explains why co-training may not help much when the initial labeled set is large. 

A limitation of our data contamination model ([T]) is that, in modeling the 
phenomenon of image mis-registration with a data contamination model, i.i.d. 
contaminations are assumed. However, in practice the mis-registered image 
pixels may be correlated in some way. It is thus desirable to take this into 
account in the model, which we shall leave to future work. Note that we derive 
the data contamination bound under a general class of data distributions, it is 
desired to take advantage of knowledge on the underlying distribution to get 



£'rr(Bayes classifier on G) 

< £'rr(Classifier learned on n observations from G) 

< Err{Ba,yes classifier on G) H h O (c(n)) 



£'rr( Classifier learned on I examples from G) 
^ Err (Bayes classifier on G) + — h O (c(n)) 



1 -e 



(10) 
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a sharper bound. Note also that the focus of the present paper is the analysis 
and simulation on the impact of data contamination to classification accuracy, 
no new algorithm is proposed. We shall leave that to future work, interested 
readers can see, for example, ZV. and references therein. 
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